spatialHeatmap 2.1.3
The primary functionality of spatialHeatmap package is to visualize cell-, tissue- and organ-specific data of biological assays by coloring the corresponding spatial features (cellular compartments, tissues, organs, etc) defined in anatomical images according to a numeric color key. This core functionality of the package is called a spatial heatmap (SHM) plot. SHM has extended functionalities of spatial enrichment (SE) and clustering. SE is specialized in detecting genes that are specifically expressed in a particular spatial feature, while clustering is designed to detect biological molecule groups sharing related abundance profiles (e.g. gene modules) and visualize them in matrix heatmaps combined with hierarchical clustering dendrograms and network representations. Details of these functionalities are presented in the other vignette.
Except for SHM, SE, and clustering, spatialHeatmap has a novel functionality of integrated co-visualization of bulk tissues and single cells, and this feature is the focus in this vignette.
The co-visualization is a novel functionality in that it bridges single cells and source bulk tissues in a composite plot. Single cells are visualized in embedding plots through dimensionality reduction algorithms of PCA, UMAP, or TSNE while source bulk tissues in SHM plots. The key point in co-visualization is the matching between cell groups (clusters) and bulk tissues. To maximize flexibility, two methods are provided for the matching, i.e. manual and automatic, which are implemented in both R command line and Shiny app (Chang et al. 2021, Chang and Borges Ribeiro (2018)).
Data containers in co-visualization include data frame, SummarizedExperiment (SE), and SingleCellExperiment (SCE). The first two are detailed in the other vignette. Compared with SE, SCE functions similarly with SE except three main differences that are relevant to co-visualization. First, SCE has a label column in the colData slot for storing custom labels of cell groups in manual mathcing. Second, when using the Shiny app, single cell data in manual matching or combined single cell and bulk data in auto-matching are saved in an .rds file of SingleCellExperiment object by saveRDS. Third, when using the auto-matching in Shiny app, the true matching table across single cells, bulk tissues, and aSVG features is stored in the metadata slot of SCE.
The images in co-visualization are in the same form of annotated SVGs (aSVGs) with SHM plots, which is detained in the other vignette.
The spatialHeatmap package should be installed from an R (version \(\ge\) 3.6) session with the BiocManager::install command.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("spatialHeatmap")
Next, the packages required for running the sample code in this vignette need to be loaded.
library(spatialHeatmap); library(SummarizedExperiment); library(scran); library(scater); library(igraph); library(SingleCellExperiment); library(BiocParallel)
The following lists the vignette(s) of this package in an HTML browser. Clicking the corresponding name will open this vignette.
browseVignettes('spatialHeatmap')
In manual matching, single cell clusters can be automatically detected or pre-defined by users. When matching with source bulk tissues, the matching relationship needs to be manualy defined. In R-command line, the matching is defined in a name list while in Shiny App it is simply dragging and dropping.
The example single cell data of mouse brain are from a oligodendrocyte heterogeneity study in mouse central nervous system (Marques et al. 2016). Before co-visualizing, single cell data are pre-processed and clustered, which is learned from Bioconductor OCSA. Since these steps are not the focus, details are not explained.
To obtain reproducible results, always start a new R session and set a fixed seed for Random Number Generator at the beginning, which is required only once in each R session.
set.seed(10)
Read the example single cell data.
sce.manual.pa <- system.file("extdata/shinyApp/example", "sce_manual_mouse.rds", package="spatialHeatmap")
sce.manual <- readRDS(sce.manual.pa)
Quality control through mitochondria and spike-in genes.
stats <- perCellQCMetrics(sce.manual, subsets=list(Mt=rowData(sce.manual)$featureType=='mito'), threshold=1)
sub.fields <- 'subsets_Mt_percent'
ercc <- 'ERCC' %in% altExpNames(sce.manual)
if (ercc) sub.fields <- c('altexps_ERCC_percent', sub.fields)
qc <- perCellQCFilters(stats, sub.fields=sub.fields, nmads=3)
# Discard unreliable cells.
colSums(as.matrix(qc))
## low_lib_size low_n_features high_subsets_Mt_percent
## 0 0 0
## discard
## 0
sce.manual <- sce.manual[, !qc$discard]
Normalization.
clusters <- quickCluster(sce.manual)
sce.manual <- computeSumFactors(sce.manual, cluster=clusters)
sce.manual <- logNormCounts(sce.manual)
Dimensionality reduction through PCA, UMAP, and TSNE.
# Variance modelling.
df.var <- modelGeneVar(sce.manual)
top.hvgs <- getTopHVGs(df.var, prop = 0.1, n = 3000)
# Dimensionality reduction.
sce.manual <- denoisePCA(sce.manual, technical=df.var, subset.row=top.hvgs)
sce.manual <- runTSNE(sce.manual, dimred="PCA")
sce.manual <- runUMAP(sce.manual, dimred = "PCA")
Clustering. Cell clusters are detected by first building a graph object then partitioning the graph, where cells are nodes in the graph.
# Build graph.
snn.gr <- buildSNNGraph(sce.manual, use.dimred="PCA")
# Partition graph to detect cell clusters.
cluster <- paste0('clus', cluster_walktrap(snn.gr)$membership)
table(cluster)
## cluster
## clus1 clus2 clus3 clus4 clus5 clus6
## 101 83 50 59 29 20
Cell cluster assignments are stored in the colData slot of SingleCellExperiment. Cell clusters/groups pre-defined by users need to be stored in the label column while cell clusters automatically detected by clustering algorithm are stored in the cluster column. If there are experimental variables such as treatments or time points, they should be stored in the expVar column.
cdat <- colData(sce.manual)
lab.lgc <- 'label' %in% make.names(colnames(cdat))
if (lab.lgc) {
cdat <- cbind(cluster=cluster, colData(sce.manual))
idx <- colnames(cdat) %in% c('cluster', 'label')
cdat <- cdat[, c(which(idx), which(!idx))]
} else cdat <- cbind(cluster=cluster, colData(sce.manual))
colnames(cdat) <- make.names(colnames(cdat))
colData(sce.manual) <- cdat; cdat[1:3, ]
## DataFrame with 3 rows and 8 columns
## cluster label age inferred.cell.type
## <character> <character> <character> <character>
## 1 clus3 hypothalamus p22 NFOL1
## 2 clus3 SN.VTA p22 NFOL2
## 3 clus4 dorsal.horn p20 NFOL2
## Sex strain expVar sizeFactor
## <character> <character> <character> <numeric>
## 1 F C57BL6 control 1.22941
## 2 pooled male and female CD1 control 1.16755
## 3 ? C57BL6 control 1.08944
Embedding plot of single cells with clusters labeled by the label column in colData.
plotUMAP(sce.manual, colour_by="label")
Figure 1: Embedding plot single cells
The cell clusters are defined by the label column in colData.
The spatial features in mouse brain aSVG are extracted. They are the bulk tissues to be matched with cell clusters.
svg.mus.brain <- system.file("extdata/shinyApp/example", "mus_musculus.brain.svg", package="spatialHeatmap")
# Spatial features to match with single cell clusters.
feature.df <- return_feature(svg.path=svg.mus.brain)
## Accessing features...
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## EFO_0000530
##
## mus_musculus.brain.svg,
feature.df$feature
## [1] "path16294" "path16312"
## [3] "path16316" "path5402"
## [5] "medulla.oblongata" "cerebral.cortex"
## [7] "corpus.striatum" "diencephalon"
## [9] "pituitary.gland" "hippocampus"
## [11] "cerebellum" "brainstem"
## [13] "midbrain" "dorsal.plus.ventral.thalamus"
## [15] "hypothalamus" "nose"
## [17] "corpora.quadrigemina"
The single cell clusters can be matched to bulk tissues according to cluster assignments in the label or cluster column in colData.
The custom cell clusters are defined in the label column of colData, which are cell sources provided in the original study.
unique(colData(sce.manual)$label)
## [1] "hypothalamus" "SN.VTA" "dorsal.horn" "cortex.S1"
## [5] "Amygdala" "corpus.callosum" "zona.incerta" "striatum"
## [9] "hippocampus.CA1" "dentate.gyrus"
Aggregate cells by clusters defined in the label column.
sce.manual.aggr <- aggr_rep(sce.manual, assay.na='logcounts', sam.factor='label', con.factor='expVar', aggr='mean')
## Syntactically valid column names are made!
Manually create the matching list.
lis.match <- list(hypothalamus=c('hypothalamus'), cortex.S1=c('cerebral.cortex'))
Co-visualization on gene St18.
shm.lis <- spatial_hm(svg.path=svg.mus.brain, data=sce.manual.aggr, ID=c('St18'), height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=2, sce.dimred=sce.manual, dimred='PCA', cell.group='label', assay.na='logcounts', tar.cell=c('matched'), lis.rematch=lis.match, bar.width=0.1, dim.lgd.nrow=1)
## Coordinates: mus_musculus.brain.svg ...
## CPU cores: 1
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## EFO_0000530
##
##
## Potential error detected in these elements: 'EFO_0000530;EFO_0000530'! If they are groups, please remove the 'transform' attribute with a 'matrix' value by ungrouping and regrouping the respective groups in Inkscape. If individual paths, consider deleting them in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## Features in data not mapped: SN.VTA, dorsal.horn, cortex.S1, Amygdala, corpus.callosum, zona.incerta, striatum, hippocampus.CA1, dentate.gyrus
## ggplots/grobs: mus_musculus.brain.svg ...
## ggplot: St18, control 6h.post.stress
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## St18_control_1 St18_6h.post.stress_1
## Converting "ggplot" to "grob" ...
##
## Converting "ggplot" to "grob" ...
## dim_St18_control_1 dim_St18_6h.post.stress_1
Figure 2: Co-visualizing single cells and bulk tissues by custom cell clusters
The expression profiles of gene St18 are used.
Automatically detected cell clusters are stored in the cluster column in colData.
unique(colData(sce.manual)$cluster)
## [1] "clus3" "clus4" "clus2" "clus1" "clus5" "clus6"
Aggregate cells by clusters defined in the label column.
sce.manual.aggr <- aggr_rep(sce.manual, assay.na='logcounts', sam.factor='cluster', con.factor=NULL, aggr='mean')
## Syntactically valid column names are made!
Manually create the matching list.
lis.match <- list(clus1=c('hypothalamus'), clus3=c('cerebral.cortex', 'midbrain'))
Co-visualization on gene St18.
shm.lis <- spatial_hm(svg.path=svg.mus.brain, data=sce.manual.aggr, ID=c('St18'), height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=3, sce.dimred=sce.manual, dimred='PCA', cell.group='cluster', assay.na='logcounts', tar.cell=c('matched'), lis.rematch=lis.match, bar.width=0.11, dim.lgd.nrow=1)
## Coordinates: mus_musculus.brain.svg ...
## CPU cores: 1
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## EFO_0000530
##
##
## Potential error detected in these elements: 'EFO_0000530;EFO_0000530'! If they are groups, please remove the 'transform' attribute with a 'matrix' value by ungrouping and regrouping the respective groups in Inkscape. If individual paths, consider deleting them in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## Features in data not mapped: clus3, clus4, clus2, clus1, clus5, clus6
## ggplots/grobs: mus_musculus.brain.svg ...
## ggplot: St18, con
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## St18_con_1
## Converting "ggplot" to "grob" ...
##
## Converting "ggplot" to "grob" ...
## dim_St18_con_1
Figure 3: Co-visualizing single cells and bulk tissues by automatically-detected cell clusters
The expression profiles of gene St18 are used.
Except for being manually defined, the matching between cells and bulk tissues in co-visualization can also be automatic. The automatic process is carried out by combining and co-clustering bulk and single cell data. The bulk and single cell data need to be derived from the same organ or same large tissue. If same organ, the single cell data are assayed on the whole organ. By contrast, if single cell data are from a whole tissue, the bulk tissues should be sub-tissues.
The potential applications of auto-matching include: 1) Reduce single cell RNA-seq (scRNA-seq) complexities. In conventional scRNA-seq, there is ususally a complex and laborious stage of isolating single cells. Auto-matching has the potential to avoid such processes since it only requires scRNA-seq on a whole organ; 2) Discover novel cell types. The cells with bulk tissue assignments are assumed to be major populations in the bulk, while cells without bulk assignemnts are likely to be novel cell types or cells at the bulk tissue boundaries; and 3) Estimate cellular compositions. If bulk tissues are representative of a whole organ, cellular compositions of the organ could be estimated according to their bulk tissue assignments. This application is useful in disease diagnose and treatment, since it helps to analyze each cell type's contribution to the disease.
The auto-matching includes two main functionalities. One is the auto-matching optimization and another is the co-visualization. The former is developed to optimize the workflow on trainning data sets, while the latter uses the optimized parameter settings for co-visualization.
Figure 4 is the co-clustering workflow overview. The inputs are raw count data (e.g. RNA-seq) of bulk tissues and single cells of the same organ (Figure 4.1). The single cells should come from the whole organ or at least covers the bulk tissues. The identities of each bulk tissue and each cell need to be labeled so as to evaluate the co-clustering performance. Bulk and single cell data are initially filtered at low strigency. The main difference between bulk and single cells is the sparsity in the latter. To reduce such difference, the bulk and single-cell data are combined, normalized, and then separated (Figure 4.2). After separation, the normalized bulk data are filtered to remove genes of low and constant expressions (Figure 4.3). The normalized single-cell data are also filtered to remove genes and cells having high zero-count rates. After filtered, the gene dimensionalities of single-cell data are reduced using PCA or UMAP method, and the top dimensionalities are clustered (Figure 4.4). In each cell cluster, cells having low similarities with other cells in the same cluster are filtered (Figure 4.5), and therefore the remaining clusters are more homogeneous (Figure 4.6). The filtered bulk and filtered single cells are combined and co-clustered (Figure 4.7).
The results include three types of co-clusters: 1) Two bulk tissues are clustered with cells. The source bulk is assigned to each cell according to Spearman's correlation coefficient. For example, in Figure 4.8a bulk A is assigned to cell a1 because a1 has higher similarity with A than B. Since the true source bulk of a1 is A, this assignment is TRUE. By contrast, cell b1 also has higher similarity with A than B, and A is assigned to b1, but this assignment is labeled FALSE since the true source bulk of b1 is B; 2). Only one bulk tissue is clustered with cells. This bulk is assigned to all the cells in the same co-cluster (4.8b); and 3) No bulk is included. All these cells are discarded (Figure 4.8c). Lastly, the Spearman's correlation coefficient and TRUE or FALSE assignments are used to create ROC plots and evaluate the performance (Figure 4.9).
Figure 4: Overview of coclustering
The coclustering are illustrated in 9 steps. Basically, bulk and single data are initially filtered, then combined, normalized, and separated. The normalized bulk and single cell data are filtered again. Single cells are clustered and resulting clusters are refined. Subsequently bulk and single cells are combined and co-clustered. In each co-cluster, bulk tissues are assigned to cells. The assignments are evaluated by ROC curves.
Since real optimizations have high demand on computing power and take a long time, it is demonstrated on toy data. Thus the result parameter settings may not be really optimal. The example bulk and single cell RNA-seq data are from Arabidopsis thaliana (Arabidopsis) root. Bulk tissue data comprise all the major root tissues such as epidermis, cortex, endodermis, xylem, columella, which are generated in a research on alternative splicing and lincRNA regulation (S. Li et al. 2016). The two single cell data sets are derived from the whole root, which are produced in a study of single cell Arabidopsis root atlas (Shahan et al. 2020). The identities of bulk and single cells are all labeled.
The optimization focuses on parameters of normalization methods, filtering, dimensionality reduction methods, refining homogeneous cell clusters, number of top dimensionalities in co-clustering, graph-building methods in co-clustering. The optimization is performed by running the co-clustering workflow (Figure 4) on each of the single cell data sets. The parameter settings being optimized is fixed and all settings of other parameters are varied across all possible combinations.
Each running of the workflow yields an AUC value, thus after running the workflow on all possible settings combinations one parameter settings has a set of AUC values. The AUCs are filtered according to some criteria and the remaining AUCs are averaged. A settings with a higher mean AUC than its counterparts are taken as optimal in a parameter. For example, when optimizing dimensionality reduction methods, the settings are PCA and UMAP. If the mean of remaining AUCs of PCA is 0.6 while UMAP is 0.55, PCA is regarded as the optimal.
Since optimzation on example data also takes a relatively long time, most of the following steps are not evaluated. A common computer with 4G memory and 4 CPUs is enough to run the following optimization process.
To obtain reproducible results, always start a new R session and set a fixed seed for Random Number Generator at the beginning, which is required only once in each R session.
set.seed(10)
Read bulk and two single cell data.
blk <- readRDS(system.file("extdata/cocluster/data", "bulk_cocluster.rds", package="spatialHeatmap")) # Bulk.
sc10 <- readRDS(system.file("extdata/cocluster/data", "sc10_cocluster.rds", package="spatialHeatmap")) # Single cell.
sc11 <- readRDS(system.file("extdata/cocluster/data", "sc11_cocluster.rds", package="spatialHeatmap")) # Single cell.
blk; sc10; sc11
## class: SummarizedExperiment
## dim: 2805 45
## metadata(0):
## assays(1): counts
## rownames(2805): AT1G01070 AT1G01120 ... ATCG00650 ATCG00770
## rowData names(0):
## colnames(45): PHLM_COMP PHLM_COMP ... HAIR CORT
## colData names(0):
## class: SingleCellExperiment
## dim: 577 1893
## metadata(0):
## assays(1): counts
## rownames(577): AT1G01470 AT1G02400 ... AT5G66870 AT5G67080
## rowData names(0):
## colnames(1893): atricho endo ... atricho cortex
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## class: SingleCellExperiment
## dim: 577 2364
## metadata(0):
## assays(1): counts
## rownames(577): AT1G01470 AT1G02400 ... AT5G66870 AT5G67080
## rowData names(0):
## colnames(2364): atricho tricho ... endo atricho
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
These example data are already pre-processed. To demonstrate the optimization process the pre-processing steps are perfomed again with few genes or cells removed.
Inital filtering with low strigency before normalization.
blk <- filter_data(data=blk, pOA=c(0.2, 15), CV=c(1.5, 100))
fil.init <- filter_cell(lis=list(sc10=sc10, sc11=sc11), bulk=blk, gen.rm='^ATCG|^ATCG', min.cnt=1, p.in.cell=0.3, p.in.gen=0.1); fil.init
Combine and normalize bulk and single cell data, then separate them. By default computeSumFactors (fct) in scran package is used (Lun, McCarthy, and Marioni 2016). If cpm=TRUE, additional normalization of counts per million is applied.
norm.fct <- norm_multi(dat.lis=fil.init, cpm=FALSE) # fct.
norm.cpm <- norm_multi(dat.lis=fil.init, cpm=TRUE) # fct + cpm
Secondary filtering with higher strigency after normalization. Four sets of filtering parameter settings are created. In bulk data, genes with expression values over A across samples of over proportion p and with coefficinet of variance (CV) between cv1 and cv2 are retained. In cell data, genes with expression values over min.cnt of at least proportion p.in.gen are retained, and cells with with expression values over min.cnt of at least proportion p.in.cell are retained.
df.par.fil <- data.frame(p=c(0.1, 0.2, 0.3, 0.4), A=rep(1, 4), cv1=c(0.1, 0.2, 0.3, 0.4), cv2=rep(100, 4), min.cnt=rep(1, 4), p.in.cell=c(0.1, 0.25, 0.3, 0.35), p.in.gen=c(0.01, 0.05, 0.1, 0.15))
df.par.fil
## p A cv1 cv2 min.cnt p.in.cell p.in.gen
## 1 0.1 1 0.1 100 1 0.10 0.01
## 2 0.2 1 0.2 100 1 0.25 0.05
## 3 0.3 1 0.3 100 1 0.30 0.10
## 4 0.4 1 0.4 100 1 0.35 0.15
Filter bulk and cell data using the four filtering settings. The results are automatically saved in the working directory wk.dir and are recognized in the downstream. Thus the working directory should be the same across the entire workflow.
if (!dir.exists('opt_res')) dir.create('opt_res')
fct.fil.all <- filter_iter(bulk=norm.fct$bulk, cell.lis=list(sc10=norm.fct$sc10, sc11=norm.fct$sc11), df.par.fil=df.par.fil, gen.rm='^ATCG|^ATCG', wk.dir='opt_res', norm.meth='fct')
cpm.fil.all <- filter_iter(bulk=norm.cpm$bulk, cell.lis=list(sc10=norm.cpm$sc10, sc11=norm.cpm$sc11), df.par.fil=df.par.fil, gen.rm='^ATCG|^ATCG', wk.dir='opt_res', norm.meth='cpm')
To evaluate the downstream auto-matching performance, a ground-truth matching relationship is required in form of data.frame. The cell and trueBulk refer to bulk tissue identifiers in aSVG files, single cell identifiers and bulk tissue identifiers in the data for co-clustering, respectively. If a cell matches multiple bulk tissues, bulk identifiers are separated by comma, semicolon, or single space such as NONHAIR,LRC_NONHAIR. The SVGBulk is the bulk identifiers in aSVG files, which are recognized in co-visualization.
match.pa <- system.file("extdata/cocluster/data", "match_arab_root_cocluster.txt", package="spatialHeatmap")
df.match.arab <- read.table(match.pa, header=TRUE, row.names=1, sep='\t')
df.match.arab[1:3, ]
## SVGBulk cell trueBulk
## 1 NONHAIR atricho NONHAIR,LRC_NONHAIR
## 2 COLU colu.dist.colu COLU
## 3 COLU colu.dist.lrc COLU
In real application, the whole optimization takes a long time and requires a lot of computation power. For example, combined bulk and cell data with 6945 genes and 7747 samples requires about 20G memory for coclustering. To speed up computation, the optimization function coclus_opt provides two levels of parallel computing through BiocParallel (Morgan et al. 2021). The first one is BatchtoolsParam and relies on a cluster scheduler such as the SLURM scheduler and the second one utilizes MulticoreParam.
Before optimzation, users could check the parallelization guide by setting parallel.info=TRUE, then it returns the max possible parallelizations for each level respectively.
coclus_opt(wk.dir='opt_res', parallel.info=TRUE, dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.4, by=0.1), sim.p=seq(0.2, 0.4, by=0.1), dim=seq(5, 7, by=1))
A SLURM template is provided as an example for the first level parallelization. Users are advised to make a new copy and set SLURM parameters in the new copy. If users have access to other cluster schedulers, the template should be provided accordingly.
file.copy(system.file("extdata/cocluster", "slurm.tmpl", package="spatialHeatmap"), './slurm.tmpl')
Below is the demonstration of two-level parallelization on SLURM. For instance, the first- and second-level parallelizations are set 3 and 2 cpu cores respectively. The wk.dir is the same in secondary filtering.
sim and sim.p are parameters in refining cell clusters (Figure 4.5). Specifically, in a cell cluster, cells having similarities over sim with other cells in the same cluster of at least proportion sim.p would remain. sim is Spearman' or Pearson's correlation coefficient. dim is the number of top dimensionalities (equivalent to genes) in co-clustering. Since the three parameters are related to each other, they are treated as a set spd.set.
opt <- coclus_opt(wk.dir='opt_res', dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.4, by=0.1), sim.p=seq(0.2, 0.4, by=0.1), dim=seq(5, 7, by=1), df.match=df.match.arab, batch.par=BatchtoolsParam(workers=3, cluster="slurm", template='slurm.tmpl', RNGseed=100, stop.on.error = FALSE, log =TRUE, logdir=file.path('opt_res', 'batch_log')), multi.core.par=MulticoreParam(workers=2), verbose=FALSE)
If no cluster scheduler is available, optimization can be parallelized only at the second-level by setting batch.par=NULL.
opt <- coclus_opt(wk.dir='opt_res', dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.4, by=0.1), sim.p=seq(0.2, 0.4, by=0.1), dim=seq(5, 7, by=1), df.match=df.match.arab, batch.par=NULL, multi.core.par=MulticoreParam(workers=2))
The performace of each combination of parameter settings on each single cell data set is measured by an AUC value in ROC curve. These AUCs are filtered according to a cutoff (aucs over 0.5) and corresponding total bulk assignments (total.min) and total true assignments (true.min). The following demonstrates how to visualize the AUCs and select optimal parameter settings.
Extract AUCs for each filtering settings across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.fil <- auc_stat(wk.dir='opt_res', tar.par='filter', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each filtering settings and AUC cutoff.
df.lis.fil$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.fil[[1]], bar.width=0.07, title='Mean AUCs by filtering settings', x.text.size=15, y.text.size=15, lgd.text.size=15)
Figure 5: Mean AUCs of filtering settings
One bar refers to mean AUCs of a filtering settings at a certain AUC cutoff.
All AUCs by each filtering settings and AUC cutoff.
auc_violin(df.lis=df.lis.fil, xlab='Filtering settings', x.text.size=13, y.text.size=13)
Figure 6: All AUCs of filtering settings
A violin plot refers to all AUCs of a filtering settings at a certain AUC cutoff.
According to the mean AUCs, optimal filtering settings are fil1, fil2, fil3.
df.par.fil[c(1, 2, 3), ]
## p A cv1 cv2 min.cnt p.in.cell p.in.gen
## 1 0.1 1 0.1 100 1 0.10 0.01
## 2 0.2 1 0.2 100 1 0.25 0.05
## 3 0.3 1 0.3 100 1 0.30 0.10
Extract AUCs for normalization methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.norm <- auc_stat(wk.dir='opt_res', tar.par='norm', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each normalization method and AUC cutoff.
df.lis.norm$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.norm[[1]], bar.width=0.07, title='Mean AUCs by normalization methods', x.text.size=15, y.text.size=15, lgd.text.size=15)
All AUCs by each normalization method and AUC cutoff.
auc_violin(df.lis=df.lis.norm, xlab='Normalization methods', x.text.size=13, y.text.size=13)
Optimal normalization method: fct (computeSumFactors).
Extract AUCs for graph-building methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.graph <- auc_stat(wk.dir='opt_res', tar.par='graph', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each graph-building method and AUC cutoff.
df.lis.graph$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.graph[[1]], bar.width=0.07, title='Mean AUCs by graph-building methods')
All AUCs by each graph-building method and AUC cutoff.
auc_violin(df.lis=df.lis.graph, xlab='Graph-building methods')
Optimal graph-building methods: knn (buildKNNGraph).
Extract AUCs for dimensionality reduction methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.dimred <- auc_stat(wk.dir='opt_res', tar.par='dimred', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each dimensionality reduction method and AUC cutoff.
df.lis.dimred$df.auc.mean[1:3, ]
# Mean AUCs by each dimensionality reduction method and AUC cutoff.
mean_auc_bar(df.lis.dimred[[1]], bar.width=0.07, title='Mean AUCs by dimensionality reduction methods')
All AUCs by each dimensionality reduction method and AUC cutoff.
auc_violin(df.lis=df.lis.dimred, xlab='Dimensionality reduction')
Optimal dimensionality reduction method: pca (denoisePCA).
Extract AUCs for spd.set across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.spd <- auc_stat(wk.dir='opt_res', tar.par='spd.set', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
df.lis.spd$auc0.5$df.frq[1:3, ]
All AUCs of top five spd.sets ranked by frequency across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
spd_auc_violin(df.lis=df.lis.spd, n=5, xlab='spd.sets', x.vjust=0.6)
Top five spd.sets across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively are taken as optimal spd.sets.
n <- 5; df.spd.opt <- NULL
for (i in df.lis.spd) {
df.spd.opt <- rbind(df.spd.opt, i$df.frq[seq_len(n), c('sim', 'sim.p', 'dim')])
}
df.spd.opt$spd.set <- paste0('s', df.spd.opt$sim, 'p', df.spd.opt$sim.p, 'd', df.spd.opt$dim)
df.spd.opt <- subset(df.spd.opt, !duplicated(spd.set))
df.spd.opt[1:3, ]
In real application, the optimized settings need to be validated on data sets from other organs of different species, which is presented below.
Ideally, the co-clustering should be optimized on different organs from different organisms as many possible. The single cell data need to be generated on whole organs and each cell's identity need to be labeled. Such data are less common and not easy to obtain in public databases, since most single cell RNA-seq (scRNA-seq) assays only focus on specific cell populations rather than whole organs, which are isolated by microdissection or fluorescent assisted cell sorting (FACS). As a result, the co-clustering optimization is performed only on five single cell data sets of Arabidopsis thaliana (Arabidopsis) root. The optimized parameter settings are validated on mouse brain and kidney.
The optimization in real case has high demand on computing power and takes a long time, so most of the following steps are not evaluated. The following steps are not explained in details since they are the same as last section.
The bulk (S. Li et al. 2016) and five single cell (Shahan et al. 2020) data sets of Arabidopsis root are accessed from the same studies as last section. Details about how to access and format them are described here. In the following, blk.arb.rt refers to bulk data and sc.arab.rt10, sc.arab.rt11, sc.arab.rt12, sc.arab.rt30, sc.arab.rt31 refers to the five single cell data sets respectively.
To obtain reproducible results, always start a new R session and set a fixed seed for Random Number Generator at the beginning, which is required only once in each R session.
set.seed(10)
Inital filtering with low strigency before normalization.
blk.arab.rt <- filter_data(data=blk.arab.rt, pOA=c(0.05, 5), CV=c(0.05, 100))
fil.init <- filter_cell(lis=list(sc10=sc.arab.rt10, sc11=sc.arab.rt11, sc12=sc.arab.rt12, sc30=sc.arab.rt30, sc31=sc.arab.rt31), bulk=blk, gen.rm='^ATCG|^ATCG', min.cnt=1, p.in.cell=0.01, p.in.gen=0.05); fil.init
Combine and normalize bulk and single cell data, then separate them.
norm.fct <- norm_multi(dat.lis=fil.init, cpm=FALSE) # fct.
norm.cpm <- norm_multi(dat.lis=fil.init, cpm=TRUE) # fct + cpm
Secondary filtering with higher strigency after normalization. Four sets of filtering parameter settings are created.
df.par.fil <- data.frame(p=c(0.1, 0.2, 0.3, 0.4), A=rep(1, 4), cv1=c(0.1, 0.2, 0.3, 0.4), cv2=rep(100, 4), min.cnt=rep(1, 4), p.in.cell=c(0.1, 0.25, 0.3, 0.35), p.in.gen=c(0.01, 0.05, 0.1, 0.15))
df.par.fil
## p A cv1 cv2 min.cnt p.in.cell p.in.gen
## 1 0.1 1 0.1 100 1 0.10 0.01
## 2 0.2 1 0.2 100 1 0.25 0.05
## 3 0.3 1 0.3 100 1 0.30 0.10
## 4 0.4 1 0.4 100 1 0.35 0.15
Filter bulk and cell data using the four filtering settings. The results are automatically saved in the working directory wk.dir and are recognized in the downstream. Thus the working directory should be the same across the entire workflow.
if (!dir.exists('opt_real_res')) dir.create('opt_real_res')
fct.fil.all <- filter_iter(bulk=norm.fct$bulk, cell.lis=list(sc10=norm.fct$sc10, sc11=norm.fct$sc11, sc12=norm.fct$sc12, sc30=norm.fct$sc30, sc31=norm.fct$sc31), df.par.fil=df.par.fil, gen.rm='^ATCG|^ATCG', wk.dir='opt_real_res', norm.meth='fct')
cpm.fil.all <- filter_iter(bulk=norm.cpm$bulk, cell.lis=list(sc10=norm.cpm$sc10, sc11=norm.cpm$sc11, sc12=norm.cpm$sc12, sc30=norm.cpm$sc30, sc31=norm.cpm$sc31), df.par.fil=df.par.fil, gen.rm='^ATCG|^ATCG', wk.dir='opt_real_res', norm.meth='cpm')
Ground-truth matching relationship across cell, trueBulk, and SVGBulk.
match.pa <- system.file("extdata/cocluster/data", "match_arab_root_cocluster.txt", package="spatialHeatmap")
df.match.arab <- read.table(match.pa, header=TRUE, row.names=1, sep='\t')
df.match.arab[1:3, ]
## SVGBulk cell trueBulk
## 1 NONHAIR atricho NONHAIR,LRC_NONHAIR
## 2 COLU colu.dist.colu COLU
## 3 COLU colu.dist.lrc COLU
The max possible parallelizations for each level respectively.
coclus_opt(wk.dir='opt_real_res', parallel.info=TRUE, dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.8, by=0.1), sim.p=seq(0.2, 0.8, by=0.1), dim=seq(5, 40, by=1))
Take the SLURM scheduler as an example for two-level parallelizatio. Make a new copy of the default SLURM template and set parameters in the new copy.
file.copy(system.file("extdata/cocluster", "slurm.tmpl", package="spatialHeatmap"), './slurm.tmpl')
The first- and second-level parallelizations are set 3 and 2 cpu cores respectively. The wk.dir is the same in secondary filtering. Note the settings of spd.set (sim/sim.p/dim) has wider ranges than in last section. The parallel computation is performed at High-Performance Computing Center (HPCC) at University of California, Riverside.
opt <- coclus_opt(wk.dir='opt_real_res', dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.8, by=0.1), sim.p=seq(0.2, 0.8, by=0.1), dim=seq(5, 40, by=1), df.match=df.match.arab, batch.par=BatchtoolsParam(workers=3, cluster="slurm", template='slurm.tmpl', RNGseed=100, stop.on.error=FALSE, log=TRUE, logdir=file.path('opt_res', 'batch_log')), multi.core.par=MulticoreParam(workers=2), verbose=FALSE)
If no cluster scheduler is not available, optimization can be parallelized only at the second-level by setting batch.par=NULL.
opt <- coclus_opt(wk.dir='opt_real_res', dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.8, by=0.1), sim.p=seq(0.2, 0.8, by=0.1), dim=seq(5, 40, by=1), df.match=df.match.arab, batch.par=NULL, multi.core.par=MulticoreParam(workers=2))
The performace of each combination of parameter settings on each single cell data set is measured by an AUC value in ROC curve. These AUCs are filtered according to a cutoff (aucs over 0.5) and corresponding total bulk assignments (total.min) and total true assignments (true.min). The following demonstrates how to visualize the AUCs and select optimal parameter settings.
Extract AUCs for each filtering settings across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.fil <- auc_stat(wk.dir='opt_real_res', tar.par='filter', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each filtering settings and AUC cutoff.
df.lis.fil$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.fil[[1]], bar.width=0.07, title='Mean AUCs by filtering settings')
Figure 7: Mean AUCs of filtering settings in real optimization
A bar refers to mean AUCs of a filtering settings at a certain AUC cutoff.
All AUCs by each filtering settings and AUC cutoff.
auc_violin(df.lis=df.lis.fil, xlab='Filtering settings')
Figure 8: All AUCs of filtering settings in real optimization
A violin plot refers to all AUCs of a filtering settings at a AUC cutoff.
According to the mean AUCs, fil1, fil2, and fil3 are selected as optimal filtering settings.
df.par.fil[c(1, 2, 3), ]
## p A cv1 cv2 min.cnt p.in.cell p.in.gen
## 1 0.1 1 0.1 100 1 0.10 0.01
## 2 0.2 1 0.2 100 1 0.25 0.05
## 3 0.3 1 0.3 100 1 0.30 0.10
Extract AUCs for normalization methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.norm <- auc_stat(wk.dir='opt_real_res', tar.par='norm', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each normalization method and AUC cutoff.
df.lis.norm$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.norm[[1]], bar.width=0.07, title='Mean AUCs by normalization methods')
All AUCs by each normalization method and AUC cutoff.
auc_violin(df.lis=df.lis.norm, xlab='Normalization methods')
Optimal normalization method: fct (computeSumFactors).
Extract AUCs for graph-building methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.graph <- auc_stat(wk.dir='opt_real_res', tar.par='graph', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each graph-building method and AUC cutoff.
df.lis.graph$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.graph[[1]], bar.width=0.07, title='Mean AUCs by graph-building methods')
All AUCs by each graph-building method and AUC cutoff.
auc_violin(df.lis=df.lis.graph, xlab='Graph-building methods')
Since knn (buildKNNGraph) and snn (buildSNNGraph) have similar mean AUCs, both are selected as optimal graph-building methods.
Extract AUCs for dimensionality reduction methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.dimred <- auc_stat(wk.dir='opt_real_res', tar.par='dimred', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each dimensionality reduction method and AUC cutoff.
df.lis.dimred$df.auc.mean[1:3, ]
# Mean AUCs by each dimensionality reduction method and AUC cutoff.
mean_auc_bar(df.lis.dimred[[1]], bar.width=0.07, title='Mean AUCs by dimensionality reduction methods')
All AUCs by each dimensionality reduction method and AUC cutoff.
auc_violin(df.lis=df.lis.dimred, xlab='Dimensionality reduction')
Optimal dimensionality reduction method: pca (denoisePCA).
Extract AUCs for spd.set across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.spd <- auc_stat(wk.dir='opt_real_res', tar.par='spd.set', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
df.lis.spd$auc0.5$df.frq[1:3, ]
All AUCs of top five spd.sets ranked by frequency across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
spd_auc_violin(df.lis=df.lis.spd, n=5, xlab='spd.sets', x.vjust=0.6)
Top five spd.sets across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively are taken as optimal spd.sets. s, p, d stands for sim, sim.p, dim respectively. E.g. s0.2p0.5d12 means sim = 0.2, sim.p = 0.5, dim = 12.
n <- 5; df.spd.opt <- NULL
for (i in df.lis.spd) {
df.spd.opt <- rbind(df.spd.opt, i$df.frq[seq_len(n), c('sim', 'sim.p', 'dim')])
}
df.spd.opt$spd.set <- paste0('s', df.spd.opt$sim, 'p', df.spd.opt$sim.p, 'd', df.spd.opt$dim)
df.spd.opt <- subset(df.spd.opt, !duplicated(spd.set))
df.spd.opt[1:3, ]
## sim sim.p dim spd.set
## 1 0.2 0.2 5 s0.2p0.2d5
## 22 0.2 0.5 5 s0.2p0.5d5
## 57 0.2 0.3 6 s0.2p0.3d6
The optimal parameter settings at this stage are listed in the table below.
| normalization | filtering.set | dimensionality.reduction | graph.building | spd.set |
|---|---|---|---|---|
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.2d5 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.5d5 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.3d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.6d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.5d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.2d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.3d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.4p0.2d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.3d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.7d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.8d13 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.8d10 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.5d16 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d14 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.4p0.2d13 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d12 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d13 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.2d17 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.7d17 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.4p0.2d12 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.7d16 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.2d23 |
Next, these optimal settings are validated on mouse brain, mouse kindney, and Arabidopsis root data sets. In mouse brain, the bulk RNA-seq data are generated in a research on the impact of placental endocrine on mouse cerebellar development (Vacher et al. 2021) and the scRNA-seq data are from a study of mouse brain molecular atlas (Ortiz et al. 2020). The bulk count data are produced using systemPipeR (2.1.12) (Backman and Girke 2016). Details about how to access and format bulk and single data are described here. In the following, blk.mus.brain and sc.mus.brain refers to bulk and single cell data respectively. The validation is performed by applying these optimal settings on the same coclustering workflow, so the following procedures are not detailed.
Initial filtering.
blk.mus.brain <- filter_data(data=blk.mus.brain, pOA=c(0.05, 5), CV=c(0.05, 100))
mus.brain.lis <- filter_cell(lis=list(sc.mus=sc.mus.brain), bulk=blk.mus.brain, gen.rm=NULL, min.cnt=1, p.in.cell=0.01, p.in.gen=0.05, verbose=FALSE)
Bulk and single cell are combined and normalized, then separated.
mus.brain.lis.nor <- norm_multi(dat.lis=mus.brain.lis, cpm=FALSE)
Secondary filtering. Since fil1 and fil2 exhibit similar performaces, only fil1 is used.
blk.mus.brain.fil <- filter_data(data=mus.brain.lis.nor$bulk, pOA=c(0.1, 1), CV=c(0.1, 100), verbose=FALSE)
mus.brain.lis.fil <- filter_cell(lis=list(sc.mus=mus.brain.lis.nor$sc.mus), bulk=blk.mus.brain.fil, gen.rm=NULL, min.cnt=1, p.in.cell=0.1, p.in.gen=0.01, verbose=FALSE)
Matching table indicating true bulk tissues of each cell type and corresponding SVG bulk (spatial feature).
match.mus.brain.pa <- system.file("extdata/shinyApp/example", "match_mouse_brain_cocluster.txt", package="spatialHeatmap")
df.match.mus.brain <- read.table(match.mus.brain.pa, header=TRUE, row.names=1, sep='\t')
df.match.mus.brain
## SVGBulk cell trueBulk
## 1 cerebellum cere CERE
## 2 hippocampus hipp HIPP
## 3 cerebral.cortex isocort CERE.CORTEX
## 4 hippocampus retrohipp HIPP
## 5 hypothalamus hypotha HYPOTHA
Since knn and snn display similar performances, only knn is used. All optimal spd.set settings in Table 1 are tested, and results are shown in Figure 9a.
mus.brain.df.para <- cocluster(bulk=mus.brain.lis.fil$bulk, cell=mus.brain.lis.fil$sc.mus, df.match=df.match.mus.brain, df.para=df.spd.opt[, c('sim', 'sim.p', 'dim')], graph.meth='knn', dimred='PCA', return.all=FALSE, multi.core.par=MulticoreParam(workers=2))
In mouse kidney, four bulk tissues are selected: proximal straight tubule in cortical medullary rays (PTS2), cortical collecting duct (CCD), and cortical thick ascending limb of the loop of Henle (cTAL), glomerulus. PTS2 data are from a research on cell-type selective markers in mouse kidney (Clark et al. 2019), CCD and cTAL are from transcriptome analysis of major renal collecting duct cell types in mouse kidney (Chen et al. 2017), and glomerulus is from a transcriptome atlas study of mouse glomerulus (Karaiskos et al. 2018). The FASTQ files of the four tissues are downloaded from original studies and raw count data are generated with systemPipeR (2.1.12) (Backman and Girke 2016). The single cell data are accessed from an investigation in cellular targets of mouse kidney metabolic acidosis (Park et al. 2018). Details about how to access and format bulk and single data are described here.
The validating procedures on mouse kindey are same with mouse brain except that after initial filtering replicates in each bulk are reduced to 3 by using function reduce_rep due to two many replicates. The results are shown in Figure 9b.
In Arabidopsis root, the same bulk tissues (S. Li et al. 2016) and two additional single cell data sets (sc9, sc51, (Shahan et al. 2020)) from the same studies as real optimization are used. Details about how to access and format them are described here. The procedures of validating optimized settings are the same with mouse brain except that in normalization two single cell data sets are used instead of one. The results are shown in Figure 9c and d.
As comparisons, random combinations of non-optimal settings are generated and tested. In filtering, fil4 is regarded non-optimal, but it filters out too many genes so that the coclustering procedures cannot run. Thus fil3 is used in the random settings. The graph-building methods have two settings knn and snn, and both are taken as optimal, thus they are all used for generating random combinations. See details here.
df.par.rdn <- random_para(fil.set=c('fil3'), norm='cpm', dimred='UMAP', graph.meth=c('knn', 'snn'), sim=round(seq(0.2, 0.8, by=0.1), 1), sim.p=round(seq(0.2, 0.8,by=0.1), 1), dim=seq(5, 40, by=1), df.spd.opt=df.spd.opt)
df.par.rdn[1:3, ]
These random settings are tested on each of the four validating data sets, where other settings such as initial filtering are not changed. The results are shown in Figure 10c and d.
The AUCs of optimal and random settings are presented in Figure 9 and Figure 10 respectively. In both figures, if total bulk assignments < 500 or total true assignments < 300 or AUC < 0.5, AUCs are set 0. It is clear that the optimal settings exhibit better performance than random settings, so the optimization workflow and results are reliable to some extent. In Figure 9, asterisks indicate optimal settings have AUCs >= 0.5, total bulk assignments >= 500, and total true assignments >= 300 across all four data sets. These settings are regarded as final optimal settings (Table 2).
Figure 9: Validating optimal settings
AUCs of optimal settings on each validating data sets. A bar refers to one AUC of one optimal settings. Asterisks indicate optimal settings have AUC >= 0.5, total bulk assignments >= 500, and total true assignments >= 300 across all four data sets. If total bulk assignments < 500 or total true assignments < 300 or AUC < 0.5, AUCs are set 0. (a) Mouse brain. (b) Mouse kidndy. (c) Arabidopsis root of single cell 9. (d) Arabidopsis root of single cell 51.
Figure 10: Random settings
AUCs of random settings on each validating data sets. A bar refers to one AUC of one random settings. If total bulk assignments < 500 or total true assignments < 300 or AUC < 0.5, AUCs are set 0. (a) Mouse brain. (b) Mouse kidndy. (c) Arabidopsis root of single cell 9. (d) Arabidopsis root of single cell 51.
| normalization | filtering.set | dimensionality.reduction | graph.building | spd.set |
|---|---|---|---|---|
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.5d5 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.6d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.5d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.2d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.3d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.8d13 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.5d16 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d14 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d12 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d13 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.2d17 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.7d17 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.2d23 |
This section demonstrates co-visulization of bulk and single cells using the auto-matching functionality on mouse brain data. The bulk (Vacher et al. 2021) and single cell (Ortiz et al. 2020) RNA-seq data are from the same studies as in validating optimal settings. Both data sets are reduced for demonstration purpose. The optimal settings of fct, fil1, pca, knn, s0.2p0.8d12 in Table 2 are used.
To obtain reproducible results, always start a new R session and set a fixed seed for Random Number Generator at the beginning, which is required only once in each R session.
set.seed(10)
Read bulk and single cell data.
# Example bulk data.
blk.mus.pa <- system.file("extdata/shinyApp/example", "bulk_mouse_cocluster.txt", package="spatialHeatmap")
blk.mus <- as.matrix(read.table(blk.mus.pa, header=TRUE, row.names=1, sep='\t', check.names=FALSE))
blk.mus[1:3, 1:5]
## CERE.CORTEX HIPP HYPOTHA CERE CERE.CORTEX
## AI593442 177 256 50 24 285
## Actr3b 513 1465 228 244 666
## Adcy1 701 1243 57 1910 836
# Example single cell data.
sc.mus.pa <- system.file("extdata/shinyApp/example", "cell_mouse_cocluster.txt", package="spatialHeatmap")
sc.mus <- as.matrix(read.table(sc.mus.pa, header=TRUE, row.names=1, sep='\t', check.names=FALSE))
sc.mus[1:3, 1:5]
## isocort isocort isocort isocort olfa
## AI593442 2 2 2 5 0
## Actr3b 3 5 4 4 1
## Adcy1 3 6 6 6 0
Initial filtering.
blk.mus <- filter_data(data=blk.mus, sam.factor=NULL, con.factor=NULL, pOA=c(0.1, 5), CV=c(0.2, 100), verbose=FALSE)
mus.lis <- filter_cell(lis=list(sc.mus=sc.mus), bulk=blk.mus, gen.rm=NULL, min.cnt=1, p.in.cell=0.5, p.in.gen=0.1, verbose=FALSE)
Normalization. Bulk and single cell are combined and normalized, then separated.
To reduce runtime, some data are cached in ~/.cache/shm.
cache.pa <- '~/.cache/shm' # The path of cache.
mus.lis.nor <- read_cache(cache.pa, 'mus.lis.nor') # Retrieve data from cache.
if (is.null(mus.lis.nor)) { # Save normalized data to cache if not cached.
mus.lis.nor <- norm_multi(dat.lis=mus.lis, cpm=FALSE)
save_cache(dir=cache.pa, overwrite=TRUE, mus.lis.nor)
}
Secondary filtering.
blk.mus.fil <- filter_data(data=logcounts(mus.lis.nor$bulk), sam.factor=NULL, con.factor=NULL, pOA=c(0.1, 0.5), CV=c(0.2, 100), verbose=FALSE)
mus.lis.fil <- filter_cell(lis=list(sc.mus=logcounts(mus.lis.nor$sc.mus)), bulk=blk.mus.fil, gen.rm=NULL, min.cnt=1, p.in.cell=0.05, p.in.gen=0.02, verbose=FALSE)
The aSVG file of mouse brain.
svg.mus.brain <- system.file("extdata/shinyApp/example", "mus_musculus.brain.svg", package="spatialHeatmap")
# Spatial features.
feature.df <- return_feature(svg.path=svg.mus.brain)
## Accessing features...
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## EFO_0000530
##
## mus_musculus.brain.svg,
Matching table indicating true bulk tissues of each cell type and corresponding SVG bulk (spatial feature).
match.mus.brain.pa <- system.file("extdata/shinyApp/example", "match_mouse_brain_cocluster.txt", package="spatialHeatmap")
df.match.mus.brain <- read.table(match.mus.brain.pa, header=TRUE, row.names=1, sep='\t')
df.match.mus.brain
## SVGBulk cell trueBulk
## 1 cerebellum cere CERE
## 2 hippocampus hipp HIPP
## 3 cerebral.cortex isocort CERE.CORTEX
## 4 hippocampus retrohipp HIPP
## 5 hypothalamus hypotha HYPOTHA
The SVG bulk tissues are in the aSVG file.
df.match.mus.brain$SVGBulk %in% feature.df$feature
## [1] TRUE TRUE TRUE TRUE TRUE
Cluster single cells. Cluster labels are stored in label column in colData.
clus.sc <- read_cache(cache.pa, 'clus.sc') # Retrieve data from cache.
if (is.null(clus.sc)) {
clus.sc <- cluster_cell(data=mus.lis.fil$sc.mus, min.dim=10, max.dim=50, graph.meth='knn', dimred='PCA')
save_cache(dir=cache.pa, overwrite=TRUE, clus.sc)
}
colData(clus.sc)[1:3, ]
## DataFrame with 3 rows and 2 columns
## label cell
## <character> <character>
## isocort 17 isocort
## isocort 12 isocort
## isocort 12 isocort
Refine cell clusters.
cell.refined <- refine_cluster(clus.sc, sim=0.2, sim.p=0.8, sim.meth='spearman', verbose=FALSE)
Include matching information in colData.
cell.refined <- true_bulk(cell.refined, df.match.mus.brain)
colData(cell.refined)[1:3, ]
## DataFrame with 3 rows and 5 columns
## label cell index.all trueBulk SVGBulk
## <character> <character> <integer> <character> <character>
## isocort 1 isocort 354 CERE.CORTEX cerebral.cortex
## corti.sub 1 corti.sub 1491 none none
## hipp 1 hipp 2372 HIPP hippocampus
Cocluster bulk and single cells.
roc.lis <- read_cache(cache.pa, 'roc.lis') # Retrieve data from cache.
if (is.null(roc.lis)) {
roc.lis <- coclus_roc(bulk=mus.lis.fil$bulk, cell.refined=cell.refined, df.match=df.match.mus.brain, min.dim=13, graph.meth='knn', dimred='PCA')
save_cache(dir=cache.pa, overwrite=TRUE, roc.lis)
}
The auto-matching results are listed in roc.lis$df.roc. predictor is the similarity (Spearman's or Pearson's correlation coefficient) between bulk and cells within a co-cluster, which is used to assign bulk tissues to cells (Figure 4.8). response indicates whether the bulk assignment is correct according to the matching table. index is the cell index in the SingleCellExperiment after cell clusters are refined.
roc.lis$df.roc[1:3, ]
## assignedBulk cell response predictor index trueBulk SVGBulk
## HYPOTHA HYPOTHA hypotha TRUE 0.4285714 2335 HYPOTHA hypothalamus
## HYPOTHA1 HYPOTHA hypotha TRUE 0.5219780 2380 HYPOTHA hypothalamus
## HYPOTHA2 HYPOTHA hypotha TRUE 0.7802198 2396 HYPOTHA hypothalamus
table(roc.lis$df.roc$response)
##
## FALSE TRUE
## 28 637
ROC curve is created according to roc.lis$df.roc and the AUC value indicates the auto-matching performance.
plot(roc.lis$roc.obj, print.auc=TRUE)
Figure 11: ROC curve of auto-matching on mouse brain data
The AUC value is a measure of accuracy on bulk assignments.
Incorporate cell.refined in roc.lis for downstream co-visualization.
res.lis <- c(list(cell.refined=cell.refined), roc.lis)
The processes of clustering single cells, refining cell clusters, and coclustering bulk and single cells can be performed in a single function call. Setting return.all=TRUE returns a list of refined cell clusters, ROC object, and a data.frame of auto-matching results. If return.all=FALSE, a data.frame of parameter settings and resulting AUC is returned.
res.lis <- cocluster(bulk=mus.lis.fil$bulk, cell=mus.lis.fil$sc.mus, df.match=df.match.mus.brain, df.para=NULL, sim=0.2, sim.p=0.8, dim=13, graph.meth='knn', dimred='PCA', sim.meth='spearman', return.all=TRUE)
res.lis <- res.lis[[1]]
The function cocluster accepts multiple combinations of parameter settings provided in a data.frame, and coclustering on these combinations can be performed in parallel on multiple cpu cores through multi.core.par.
Multiple combinations of parameter settings. If some parameters are not specified in this table such as graph.meth, their default settings will be used.
df.par <- data.frame(sim=c(0.2, 0.3), sim.p=c(0.8, 0.7), dim=c(12, 13))
df.par
## sim sim.p dim
## 1 0.2 0.8 12
## 2 0.3 0.7 13
The coclustering is run on 2 cpu cores (workers=2).
res.multi <- cocluster(bulk=mus.lis.fil$bulk, cell=mus.lis.fil$sc.mus, df.match=df.match.mus.brain, df.para=df.par, sc.dim.min=10, max.dim=50, sim=0.2, sim.p=0.8, dim=13, graph.meth='knn', dimred='PCA', sim.meth='spearman', return.all=TRUE, multi.core.par=MulticoreParam(workers=2))
The auto-matching results through coclustering can be tailored through "Lasso Select" on the convenience Shiny app (desired_bulk_shiny) or manually defining desired bulk. If the former, save cell.refined in an .rds file by saveRDS(cell.refined, file='cell.refined.rds') and upload cell.refined.rds to the Shiny app.
Example of desired bulk downloaded from the convenience Shiny app.
desired.blk.pa <- system.file("extdata/shinyApp/example", "selected_cells_with_desired_bulk.txt", package="spatialHeatmap")
df.desired.bulk <- read.table(desired.blk.pa, header=TRUE, row.names=1, sep='\t')
df.desired.bulk[1:3, ]
## x y key desiredSVGBulk dimred
## 1 4.389422 6.917510 62 cerebellum PCA
## 2 4.586877 8.077207 75 cerebellum PCA
## 3 6.366188 7.695782 76 cerebellum PCA
Before manually defining desired bulk, check cells in the embedding plot.
plot_dim(res.lis$cell.refined, dim='PCA', color.by='cell', x.break=seq(-10, 10, 2), y.break=seq(-10, 10, 2))
Figure 12: PCA embedding plot of mouse brain single cell data
Single cell data after cluster refining are plotted.
Manually define desired bulk for certain cells by x-y coordinates ranges in the embedding plot. The dimred reveals where the coordinates come from and are required.
df.desired.bulk <- data.frame(x.min=c(2, 6), x.max=c(4, 10), y.min=c(6, 8), y.max=c(8, 10), desiredSVGBulk=c('cerebral.cortex', 'cerebral.cortex'), dimred='PCA')
df.desired.bulk
## x.min x.max y.min y.max desiredSVGBulk dimred
## 1 2 4 6 8 cerebral.cortex PCA
## 2 6 10 8 10 cerebral.cortex PCA
Extract cells with bulk assignments. If df.desired.bulk is provided a value, the corrresponding assignments are incorporated in res.lis$df.roc, and response and predictor is set TRUE and 1 respectively. thr is a cutoff for the predictor in res.lis$df.roc, so thr=0 denotes predictor is not filtered. true.only=TRUE indicates only true assignments are extracted.
sce.lis <- sub_asg(res.lis=res.lis, thr=0, df.desired.bulk=df.desired.bulk, df.match=df.match.mus.brain, true.only=TRUE)
## No cells selected for row: 2!
Aggregate extracted cells by SVGBulk. The aggregated cells are equivalent to bulk tissues (spatial features) in the aSVG. The aggregated abundance profiles are used to color matching bulk tissues in the aSVG image.
sce.aggr <- aggr_rep(data=sce.lis$cell.sub, assay.na='logcounts', sam.factor='SVGBulk', con.factor=NULL, aggr='mean')
Co-visualize bulk and single cells with aggregated abundance profiles of gene Adcy1. tar.bulk refers to the target bulk in aSVG and all corresponding cells are highlighted. Cells with true assignments of tar.bulk are colored according to the color key, while other corresponding cells with false or without assignments are colored black. All other cells not corresponding to tar.bulk are in gray.
shm.lis1 <- spatial_hm(svg.path=svg.mus.brain, data=sce.aggr, ID=c('Adcy1'), legend.nrow=4, sce.dimred=sce.lis$cell.refined, dimred='PCA', assay.na='logcounts', tar.bulk=c('cerebral.cortex'), profile=TRUE, dim.lgd.text.size=10, dim.lgd.nrow=1, bar.width=0.1)
## Coordinates: mus_musculus.brain.svg ...
## CPU cores: 1
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## EFO_0000530
##
##
## Potential error detected in these elements: 'EFO_0000530;EFO_0000530'! If they are groups, please remove the 'transform' attribute with a 'matrix' value by ungrouping and regrouping the respective groups in Inkscape. If individual paths, consider deleting them in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## ggplots/grobs: mus_musculus.brain.svg ...
## ggplot: Adcy1, con
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## Adcy1_con_1
## Converting "ggplot" to "grob" ...
##
## The reduced dimensionality in "data" is used: PCA .
## Converting "ggplot" to "grob" ...
## dim_Adcy1_con_1
Figure 13: Co-visualizing bulk and single cells of mouse brain with abundance profiles
The aggregated expression profiles of gene Adcy1 are plotted.
Co-visualize bulk and single cells without abundance profiles.
shm.lis2 <- spatial_hm(svg.path=svg.mus.brain, data=sce.aggr, ID=c('Adcy1'), legend.nrow=4, sce.dimred=sce.lis$cell.refined, dimred='PCA', tar.bulk=c('cerebral.cortex'), profile=FALSE, dim.lgd.text.size=10, dim.lgd.nrow=1)
## Coordinates: mus_musculus.brain.svg ...
## CPU cores: 1
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## EFO_0000530
##
##
## Potential error detected in these elements: 'EFO_0000530;EFO_0000530'! If they are groups, please remove the 'transform' attribute with a 'matrix' value by ungrouping and regrouping the respective groups in Inkscape. If individual paths, consider deleting them in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## ggplots/grobs: mus_musculus.brain.svg ...
## ggplot: Adcy1, con
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## Adcy1_con_1
## Converting "ggplot" to "grob" ...
##
## The reduced dimensionality in "data" is used: PCA .
## Converting "ggplot" to "grob" ...
## dim_mus_musculus.brain.svg
Figure 14: Co-visualizing bulk and single cells of mouse brain without abundance profiles
The matching between cells and SVG bulk is denoted by color in-between.
The auto-matching utility is included in the Shiny app. To use it, the bulk, single cell data, and matching table need to be stored in a SingleCellExperiment object. Bulk and single cell raw count data are combined and stored in the assay slot and are labeled by bulk and cell respectively by the column bulkCell in colData slot. The matching table is stored in the metadata list with the name df.match. The example below illustrates these rules.
sce.auto <- readRDS(system.file("extdata/shinyApp/example", 'sce_auto_bulk_cell_mouse_brain.rds', package="spatialHeatmap"))
colData(sce.auto)
## DataFrame with 4466 rows and 1 column
## bulkCell
## <character>
## CERE.CORTEX bulk
## HIPP bulk
## HYPOTHA bulk
## CERE bulk
## CERE.CORTEX bulk
## ... ...
## retrohipp cell
## retrohipp cell
## retrohipp cell
## retrohipp cell
## retrohipp cell
metadata(sce.auto)$df.match
## SVGBulk cell trueBulk
## 1 cerebellum cere CERE
## 2 hippocampus hipp HIPP
## 3 cerebral.cortex isocort CERE.CORTEX
## 4 hippocampus retrohipp HIPP
## 5 hypothalamus hypotha HYPOTHA
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] biomaRt_2.52.0 BiocParallel_1.30.0
## [3] igraph_1.3.1 scater_1.24.0
## [5] ggplot2_3.3.6 scran_1.24.0
## [7] scuttle_1.6.0 SingleCellExperiment_1.18.0
## [9] GEOquery_2.64.0 ExpressionAtlas_1.24.0
## [11] xml2_1.3.3 limma_3.52.0
## [13] SummarizedExperiment_1.26.1 Biobase_2.56.0
## [15] GenomicRanges_1.48.0 GenomeInfoDb_1.32.1
## [17] IRanges_2.30.0 S4Vectors_0.34.0
## [19] BiocGenerics_0.42.0 MatrixGenerics_1.8.0
## [21] matrixStats_0.62.0 spatialHeatmap_2.1.3
## [23] knitr_1.39 BiocStyle_2.24.0
## [25] nvimcom_0.9-25
##
## loaded via a namespace (and not attached):
## [1] rappdirs_0.3.3 tidyr_1.2.0
## [3] bit64_4.0.5 irlba_2.3.5
## [5] DelayedArray_0.22.0 data.table_1.14.2
## [7] rpart_4.1.16 KEGGREST_1.36.0
## [9] RCurl_1.98-1.6 doParallel_1.0.17
## [11] generics_0.1.2 preprocessCore_1.58.0
## [13] ScaledMatrix_1.4.0 cowplot_1.1.1
## [15] RSQLite_2.2.13 bit_4.0.4
## [17] tzdb_0.3.0 httpuv_1.6.5
## [19] assertthat_0.2.1 viridis_0.6.2
## [21] xfun_0.30 hms_1.1.1
## [23] jquerylib_0.1.4 evaluate_0.15
## [25] promises_1.2.0.1 fansi_1.0.3
## [27] progress_1.2.2 caTools_1.18.2
## [29] dbplyr_2.1.1 DBI_1.1.2
## [31] geneplotter_1.74.0 htmlwidgets_1.5.4
## [33] purrr_0.3.4 ellipsis_0.3.2
## [35] RSpectra_0.16-1 dplyr_1.0.9
## [37] backports_1.4.1 bookdown_0.26
## [39] annotate_1.74.0 sparseMatrixStats_1.8.0
## [41] vctrs_0.4.1 cachem_1.0.6
## [43] withr_2.5.0 checkmate_2.1.0
## [45] prettyunits_1.1.1 cluster_2.1.3
## [47] grImport_0.9-5 lazyeval_0.2.2
## [49] crayon_1.5.1 genefilter_1.78.0
## [51] edgeR_3.38.0 pkgconfig_2.0.3
## [53] labeling_0.4.2 vipor_0.4.5
## [55] nnet_7.3-17 rlang_1.0.2
## [57] lifecycle_1.0.1 filelock_1.0.2
## [59] BiocFileCache_2.4.0 rsvd_1.0.5
## [61] av_0.7.0 rsvg_2.3.1
## [63] rngtools_1.5.2 Matrix_1.4-1
## [65] Rhdf5lib_1.18.0 base64enc_0.1-3
## [67] beeswarm_0.4.0 png_0.1-7
## [69] viridisLite_0.4.0 bitops_1.0-7
## [71] shinydashboard_0.7.2 KernSmooth_2.23-20
## [73] visNetwork_2.1.0 rhdf5filters_1.8.0
## [75] pROC_1.18.0 Biostrings_2.64.0
## [77] blob_1.2.3 DelayedMatrixStats_1.18.0
## [79] doRNG_1.8.2 stringr_1.4.0
## [81] readr_2.1.2 jpeg_0.1-9
## [83] gridGraphics_0.5-1 rols_2.24.0
## [85] beachmat_2.12.0 scales_1.2.0
## [87] memoise_2.0.1 magrittr_2.0.3
## [89] plyr_1.8.7 gplots_3.1.3
## [91] distinct_1.8.0 zlibbioc_1.42.0
## [93] compiler_4.2.0 dqrng_0.3.0
## [95] RColorBrewer_1.1-3 DESeq2_1.36.0
## [97] cli_3.3.0 XVector_0.36.0
## [99] htmlTable_2.4.0 Formula_1.2-4
## [101] MASS_7.3-57 WGCNA_1.71
## [103] tidyselect_1.1.2 stringi_1.7.6
## [105] highr_0.9 yaml_2.3.5
## [107] BiocSingular_1.12.0 locfit_1.5-9.5
## [109] latticeExtra_0.6-29 ggrepel_0.9.1
## [111] grid_4.2.0 sass_0.4.1
## [113] tools_4.2.0 parallel_4.2.0
## [115] rstudioapi_0.13 bluster_1.6.0
## [117] foreach_1.5.2 foreign_0.8-82
## [119] metapod_1.4.0 gridExtra_2.3
## [121] farver_2.1.0 Rtsne_0.16
## [123] digest_0.6.29 BiocManager_1.30.17
## [125] FNN_1.1.3 shiny_1.7.1
## [127] Rcpp_1.0.8.3 later_1.3.0
## [129] httr_1.4.3 ggdendro_0.1.23
## [131] AnnotationDbi_1.58.0 colorspace_2.0-3
## [133] XML_3.99-0.9 splines_4.2.0
## [135] uwot_0.1.11 yulab.utils_0.0.4
## [137] statmod_1.4.36 ggplotify_0.1.0
## [139] plotly_4.10.0 xtable_1.8-4
## [141] jsonlite_1.8.0 dynamicTreeCut_1.63-1
## [143] UpSetR_1.4.0 flashClust_1.01-2
## [145] R6_2.5.1 Hmisc_4.7-0
## [147] pillar_1.7.0 htmltools_0.5.2
## [149] mime_0.12 glue_1.6.2
## [151] fastmap_1.1.0 BiocNeighbors_1.14.0
## [153] codetools_0.2-18 utf8_1.2.2
## [155] lattice_0.20-45 bslib_0.3.1
## [157] tibble_3.1.7 curl_4.3.2
## [159] ggbeeswarm_0.6.0 gtools_3.9.2
## [161] magick_2.7.3 GO.db_3.15.0
## [163] survival_3.3-1 rmarkdown_2.14
## [165] munsell_0.5.0 fastcluster_1.2.3
## [167] rhdf5_2.40.0 GenomeInfoDbData_1.2.8
## [169] iterators_1.0.14 HDF5Array_1.24.0
## [171] impute_1.70.0 reshape2_1.4.4
## [173] gtable_0.3.0
This project has been funded by NSF awards: PGRP-1546879, PGRP-1810468, PGRP-1936492.
Backman, Tyler W. H., and Thomas Girke. 2016. “SystemPipeR: NGS Workflow and Report Generation Environment.” BMC Bioinformatics 17 (1). Springer Science; Business Media LLC. doi:10.1186/s12859-016-1241-0.
Chang, Winston, and Barbara Borges Ribeiro. 2018. Shinydashboard: Create Dashboards with ’Shiny’. https://CRAN.R-project.org/package=shinydashboard.
Chang, Winston, Joe Cheng, JJ Allaire, Cars on Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara Borges. 2021. Shiny: Web Application Framework for R. https://CRAN.R-project.org/package=shiny.
Chen, Lihe, Jae Wook Lee, Chung-Lin Chou, Anil V Nair, Maria A Battistone, Teodor G Păunescu, Maria Merkulova, et al. 2017. “Transcriptomes of Major Renal Collecting Duct Cell Types in Mouse Identified by Single-Cell RNA-seq.” Proc. Natl. Acad. Sci. U. S. A. 114 (46): E9989–E9998.
Clark, Jevin Z, Lihe Chen, Chung-Lin Chou, Hyun Jun Jung, Jae Wook Lee, and Mark A Knepper. 2019. “Representation and Relative Abundance of Cell-Type Selective Markers in Whole-Kidney RNA-Seq Data.” Kidney Int. 95 (4): 787–96.
Karaiskos, Nikos, Mahdieh Rahmatollahi, Anastasiya Boltengagen, Haiyue Liu, Martin Hoehne, Markus Rinschen, Bernhard Schermer, et al. 2018. “A Single-Cell Transcriptome Atlas of the Mouse Glomerulus.” J. Am. Soc. Nephrol. 29 (8): 2060–8.
Li, Song, Masashi Yamada, Xinwei Han, Uwe Ohler, and Philip N Benfey. 2016. “High-Resolution Expression Map of the Arabidopsis Root Reveals Alternative Splicing and lincRNA Regulation.” Dev. Cell 39 (4): 508–22.
Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-Seq Data with Bioconductor.” F1000Res. 5: 2122. doi:10.12688/f1000research.9501.2.
Marques, Sueli, Amit Zeisel, Simone Codeluppi, David van Bruggen, Ana Mendanha Falcão, Lin Xiao, Huiliang Li, et al. 2016. “Oligodendrocyte Heterogeneity in the Mouse Juvenile and Adult Central Nervous System.” Science 352 (6291): 1326–9.
Morgan, Martin, Jiefei Wang, Valerie Obenchain, Michel Lang, Ryan Thompson, and Nitesh Turaga. 2021. BiocParallel: Bioconductor Facilities for Parallel Evaluation. https://github.com/Bioconductor/BiocParallel.
Ortiz, Cantin, Jose Fernandez Navarro, Aleksandra Jurek, Antje Märtin, Joakim Lundeberg, and Konstantinos Meletis. 2020. “Molecular Atlas of the Adult Mouse Brain.” Science Advances 6 (26). American Association for the Advancement of Science: eabb3446.
Park, Jihwan, Rojesh Shrestha, Chengxiang Qiu, Ayano Kondo, Shizheng Huang, Max Werth, Mingyao Li, Jonathan Barasch, and Katalin Suszták. 2018. “Single-Cell Transcriptomics of the Mouse Kidney Reveals Potential Cellular Targets of Kidney Disease.” Science 360 (6390): 758–63.
Shahan, Rachel, Che-Wei Hsu, Trevor M Nolan, Benjamin J Cole, Isaiah W Taylor, Anna Hendrika Cornelia Vlot, Philip N Benfey, and Uwe Ohler. 2020. “A Single Cell Arabidopsis Root Atlas Reveals Developmental Trajectories in Wild Type and Cell Identity Mutants.” BioRxiv.
Vacher, Claire-Marie, Helene Lacaille, Jiaqi J O’Reilly, Jacquelyn Salzbank, Dana Bakalar, Sonia Sebaoui, Philippe Liere, et al. 2021. “Placental Endocrine Function Shapes Cerebellar Development and Social Behavior.” Nat. Neurosci. 24 (10). Nature Publishing Group: 1392–1401.